Diabetes Type II Indian dataset project

Group 1

Members

  • Anna Lifousi (s232979)
  • Jordan Sylvester Fernandes (s222497)
  • Manuel Arcieri (s230158)
  • Quim Bech Vilaseca (s233374)
  • Xavier Viñas Margalef (s233532)

Introduction

  • Diabetes is estimated to affect approximately 530 million adults worldwide, with a global prevalence of 10.5 percent among adults aged 20 to 79 years. 1

  • Type 2 diabetes represents approximately 98 percent of global diabetes diagnoses, although this proportion varies widely among

  • Evaluate the possible factors that affect the appearance of Diabetes Type 2, for further control and prevention.

Materials and Methods

Source: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

Data Acquisition and Description

Data Cleaning, Augmentation and Analysis

  • Data Loading: Automatic data loading from server.
  • Data Cleaning: Several zero values treated as NA. In the case of Insulin, only treated as NA if the patient was diabetic.
  • Data Augmentation: Columns renamed to improve clarity and new categories created for BMI and Age.
  • Data Description & Analysis: Data analyzed and visualized statistically and graphically. Use of Principal Component Analysis to understand correlation between variables.

Results

Correlation heatmap

  • Expected strong correlation between BMI and skin thickness, since skin layer thickness increases with BMI increase.

  • Weak correlation between insulin and glucose levels, possibly since insulin levels increase when glucose levels rise in the blood during the OGTT.

Principal component analysis

  • The data was scaled and centred before performing PCA.

  • There’s no clear separation between the two classes using only the two best PC.

  • The first two principal components only account for around 50% of the total variance.

  • To reach at least 90%, we would have to include 6 PC out of 8.

Discussion

  • We analysed a dataset containing clinical data about 768 patients, some of them suffering from diabetes.

  • While we’ve found some correlation between individual properties, it’s hard to tell whether someone is ill just from a subset of features.

  • We’ve found a linear correlation between blood pressure and skin thickness.

  • PCA couldn’t separate ill individuals from healthy people using only the two best PC.

  • Obesity is a risk factor for diabetes.